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NATURAL  LANGUAGE  DATA  BASE  QUERY:  Using  the  data  base  itself 

as  the  definition  of  world  knowledge 
and  as  an  extension  of  the  dictionary 

Larry  R.  Harris 
Mathematics  Department  - 
Dartmouth  College 
Hanover,  New  Hampshire  03755 

Abstract 

This  paper  raises  two  issues  that  heretofore  have  not  been 
dealt  with  in  any  previous  natural  language  data  base  auery 
system.  These  issues  arise  because  of  the  everpresent  need  for 
world  knowledge  in  the  understanding  of  English,  and  also  because 
of  the  particular  way  in  which  information  is  stored  in  a data 
base.  The  solutions  to  these  problems  described  in  this  paper 
require  only  existing  state  of  the  art  data  base  technology. 
Systems  based  on  these  ideas  serve  as  an  example  of  a cur- 
rently attainable,  yet  usable,  natural  language  query  system. 
As  such  they  serve  as  a counter  example  of  the  philosophy  that 
natural  language  processing  . is  an  all  or  nothing  situation. 


I.  introduction 


Natural  language  data  base  query  has  long  been  recognized  as 
a useful  application  of  AI  techniques.  The  state  of  the  art  in 
both  natural  language  processing  and  in  data  base  management 
systems  (DBMS)  has  already  reached  the  point  where  the  two  could 
be  married  to  provide  a useful  access  medium  for  untrained  users. 

W^hat  has  impeded  the  useful  application  of  the  most 
successful  natural  language  systems  such  as  Wxnogradi72j  and 
Woods(72).  Most  people  agree  that  the  performance  lever  is  high 
enough,  as  shown  by  the  LUNAR  system's  success  at  the  Georogv 
conference.  Why  then,  have  the  techniques  not  been  successfully 

utilised? 

Basically  the  answer  reduces  to  economics.  The  cost  of 
running  any  of  these  systems  is  too  high.  The  cost  of  a computer 
big  enough  to  run  them  is  too  high.  Most  important  the  startup 
cost  of  applying  these  programs  to  a new  area  of  discourse  is  too 
high.  These  costs  are  only  offset  by  the  unsubstantiated  claims 
of  higher  user  efficiency  in  a natural  language  environment.  As 
of  yet,  no  one  has  chosen  to  pay  the  price. 
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Beyond  these  questions  of  cost  effectiveness  lurk  other 
problems  that  hinder  the  successful  application  of  past  systems. 
Basically  these  problems  are  related  to  the  size  of  today's 
existing  data  bases.  Because  natural  language  query  systems  such 
as  Wood's  and  Petrick's,  impose  a partition  between  the 
understanding  and  the  retrieval  functions,  they  face  potentially 
insurmountable  problems  when  applied  to  large  data  bases. 

Existing  research  in  knowledge  representation  is  an  attempt 
to  merge  these  two  functions.  That  is,  the  data  base  and  the 
internal  structures  used  by  the  parser  would  be  one  and  the  same. 
We  keenly  await  the  results  of  this  research.  An  alternate 
approach  is  to  more  closely  couple  the  understanding  function  with 
the  data  base  by  utilizing  the  high  performance  data  base 
technology  that  exists  today. 

The  basic  thesis  of  this  paper  is  that  it  is  wholly 
infeasible  to  design  natural  language  query  systems  that  do  not 
make  use  of  the  data  base  itself  as  a definition  of  world 
knowledge  and  as  an  extension  of  the  dictionary.  In  Section  II  we 
develop  this  argument  fully.  Section  III  presents  a proposed 
solution,  along  with  a brief  discussion  of  why  the  solution  is 
feasible  in  terms  of  existing  DBMS  technology.  Section  IV 
discusses  the  impact  of  this  on  the  basic  design  of  a real 
system. 
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II  The  Problem. 

Assume  for  a moment  that  all  of  the  problems  involved  with 
natural  language  understanding  were  solved  by  an  extension  of  any 
of  the  current  approaches.  Consider  what  problems  would  be 
encountered  in  applying  this  newly  developed  system  to  the 
environment  of  data  base  query. 

First,  we  must  note  that  all  the  systems  we  have  discussed 
make  use  of  an  auxilliary  dictionary,  that  typically  contains  tne 
root  form  of  words,  their  syntatic  category  and  other  useful  bits 
of  information  such  as  special  sufficies  etc.  Ail  the  existing 
natural  language  systems  expect  that  all  words  that  will  appear  j.n 
a sentence  can  be  found,  in  one  form  or  another,  in  this 
dictionary. 

Now  we  can  see  some  problems  beginning  to  arise.  Typical 
data  bases  involve  thousands  or  even  millions  of  English  words. 
If  we  are  forced  to  include  all  of  this  in  our  linguistic 
dictionary,  then  we  would  nearly  duplicate  the  data  bases 
Furthermore,  we  must  realize  that  real  data  bases  are  rarely 
stagnant,  they  can  change  daily  or  in  some  cases  continually. 
Updates  to  the  data  base  would  reauire  corresponding  changes  to  be 
made  to  the  dictionary.  Furthermore  these  changes  in  the 
dictionary  are  not  always  trivial  to  make,  since  they  involve 
enumerating  each  word's  syntactic  category  and  all  of  its  lexical 
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and  word  sense  ambiguities.  In  the  best  cases  this  could  be  done 
by  any  competent  computational  linguist  who  had  a wording 


knowledge  of  the  program.  In  the  worst  case  actual  programming 
changes  may  have  to  be  made  to  use  the  new  w^rd  correctly.  In  any 
case,  it  should  be  noted  this  is  not  a task  that  could  easily  be 
automated  or  performed  by  an  ordinary  programmer . 

A common  reaction  to  this  dilemma  is  to  solve  it  by  entering 
the  unaoridged  dictionary  once  and  for  all,  feeling  trmt  this  will 
solve  the  problem.  Not  true.  Much  of  what  the  data  case  contains 
is  a limited  form  of  world  knowledge.  Often  tney  are  about 
people,  places,  and  things.  Thus,  they  often  deal  with  proper 
names  and  composite  names,  hitting  just  at  the  unaoridged 
dictionary's  weakest  point.  You  won't  find  much  in  the 
unabridged  dictionary  about  proper  names  like  "Albert  Mahoney",  or 
"South  Podunk  Falls".  Also  composite  groups  such  as  "skynust 
blue",  or  "executive  secretary"  won't  be  found.  ifet  ail  of  these 
are  very  likely  to  appear  in  some  data  base,  and  thus  very  likely 
to  appear  in  some  query  regarding  that  data  base. 

These  last  examples  bring  up  another  separate,  but  related 
problem.  Since  the  natural  language  processor  must  at  some  point 
relate  the  query  to  the  actual  data  base,  some  "meaning"  of  these 
words  must  also  be  given  in  the  dictionary.  For  exampie . 
answering  the  questions  "Who  is  in  Debucue?"  requires  knowing  that 
Debuque  is  a city  and  this  may  appear  in  the  city  field  of  the 
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It. 


data  base.  "How  many  skymist  blue  cars  were  sold  in  1975?" 
requires  knowing  that  "skymist  blue"  is  a color  as  opposed  to  a 
manufacturer  or  a body  type.  It  should  be  emphasized  that  a pure 
syntactic  parse  indicating  that  "skymist"  modifies  "blue"  which 
modifies  "car"  is  insufficient  to  formulate  a search  to  the  data 
base.  We  must  somehow  have  access  to  the  fact  that  "skymist.  blue" 
may  appear  in  the  color  field. 

This  looks  like  an  oppc.Vzune  time  to  claim  tnat  general  world 

v 

knowledge  will  solve  the  problem.  After  all,  if  having  a list  or 
all  the  cities  in  the  world  isn't  world  knowledge,  what  is?  But 
that  is  exactly  what  would  fci  required  to  solve  the  problem  this 
way,  a list  of  cities,  a list  of  colors,  a list  of  names,  etc., 
etc. 

if  this  is  beginning  to  sound  like  another  data  base,  you're 
wrong.  It's  beginning  to  sound  like  the  same  data  base  to  which 
the  queries  are  directed.  The  answer  is  clear.  The  data  base 
itself  must  be  used  as  both  an  extension  of  the  dictionary  and  as 
a definition  of  world  knowledge. 
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Ill  A Solution. 


Exactly  what  does  it  mean  to  say  that  the  data  case  the 
definition  of  world  knowledge.  It  becomes  more  clear  when  you  ask 
someone  how  they  understand  the  queries  "What  cars  are  green?"  and 
"What  cars  are  Fords?"  In  one  case  you  search  the  color  field  ±n 
the  other  you  search  the  manufacturer  field.  But  how  did  you  know 
to  do  this?  You  called  upon  your  world  knowledge  to  identify 
"green"  as  a color  and  "Ford"  as  a manufacturer . 

Many  people  jump  to  the  conclusion  mat  to  use  world 
knowledge  to  solve  this  means  to  have  access  to  all  the  possxDie 
colors  and  manufacturers  of  cars.  But  this  ignores  the  fact  that 
people's  world  knowledge  is  not  always  complete.  For  example  you 
might  have  trouble  responding  to  "Which  ones  are  taupe?",  if  you 
didn't  know  that  "taupe"  is  indeed  a color.  Furthermore  you 
probably  bias  the  reading  of  "Are  any  of  them  Fords?"  to  thinking 
of  cars  when  it  makes  perfect  sense  as  a person's  name. 

To  say  that  we  use  the  data  base  as  a definition  of  world 
knowledge  is  to  say,  for  example,  that  if  a word  appears  in  the 
color  field  then  it  is  a color,  and  if  it  does  not  appear  in  the 
color  field  then  it  is  not  a color.  Similarly  we  define  cities, 
states  and  names,  in  fact  everything  in  the  data  base. 

You  may  argue  that  green  is  a color  whether  or  not  any  cars 
are  colored  green.  This  is  of  course  true,  but  the  same  argument 


holds  rcr  the  color  traube.  We  hope  to  obtain  a reasonable  level 
of  competence  with  an  incomplete  definition  of  world  knowledge, 
just  as  those  of  you  who  just  learned  that  traube  is  a color  have 
so  aptly  demonstrated  is  possible. 

It  nught  be  thought  that  defining  world  knowledge  in  th^s  way 
will  make  it  impossible  to  respond  to  questions  like  "How  many 
green  cars  are  there?"  when  the  word  "green"  does  not  appear  in 
the  data  base.  In  fact  this  is  not  a problem,  and  ail  questions 
of  this  type  can  be  answered  without  fear  of  misinterpretation 
since  the  answer  is  clearly  "none"  no  matter  what  "green"  means. 

However,  it  is  possible  to  misinterpret  a question  like 
"Which  of  them  are  green?"  when  ''green"  does  not  appear  in  the 
color  field  but  does  appear  in  another  field,  such  as  the  name 
field.  Assuming  that  the  user  was  asking  about  green  colored 
cars,  we  would  erroneously  generate  a query  about  people  named 
Green.  The  earlier  example  about  Ford  illustrates  that  people  are 
likely  to  make  the  same  kind  of  error.  By  echoing  back  our 
interpretation  of  the  query,  as  is  done  in  the  sample  dialog,  the 
user  can  see  if  any  such  misunderstanding  takes  place. 

In  the  case  where  "green"  appears  in  botn  the  color  and  name 
fields,  we  simply  ask  the  user  which  was  meant,  unless  the  syntax 
of  the  sentence  gives  no  further  clues.  Foi  example.  "Which  are 
green?"  would  require  an  interaction  with  the  er , whereas  "Which 
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are  colored  green?"  or  "Which  are  green  in  color?"  would  not. 

We  argued  earlier  that  we  must  use  the  data  base  as  an 
extension  of  the  dictionary  as  well  as  a definition  of  world 
knowledge.  We  now  discuss  exactly  what  this  means  and  how  these 
two  uses  are  distinct.  By  treating  the  data  base  as  an  extension 
of  the  dictionary  we  are  saying  that  we  would  like  to  be  able  to 
perform  the  same  operations  on  the  data  base  as  we  do  on  the 
dictionary.  Furthermore,  we  would  like  to  extract  the  same 
information  from  the  data  base  that  we  extract  from  the 
.ctionary . 

First,  let  us  take  up  this  issue  of  what  operations  we 
perform  on  the  dictionary.  Primarily  here  I am  speaking  of 
morphology,  stripping  words  down  to  their  root  form.  To 
understand  the  sentences  "Who  are  the  secretaries?'  and  "Who  has  a 
secretarial  job?"  requires  the  ability  to  figure  out  how  these  two 
forms  of  "secretary”  appear  in  the  actual  data  base.  We  must  be 
able  to  do  this  by  performing  operations  on  the  data  base  much 
like  we  would  perform  on  the  dictionary. 

There  is  one  further  use  of  the  dictionary  that  must  be 
performed  in  the  data  base,  that  of  forming  composite  groups.  For 
example  the  proper  noun  "New  York"  is  best  thought  of  as  one  word 
that  happens  to  have  a blank  in  it.  However,  we  need  to  ascertain 
this  fact  by  looking  in  the  data  base  or  else  we  might  parse  it  as 
"New"  modifying  "York".  In  a sentence  like  "Who  is  the  New  York 
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area  manager?"  we  must  determine  how  the  composites  are  formed 
without  the  luxury  of  having  any  of  the  last  four  words  in  the 
dictionary.  Furthermore  the  composites  could  be  "New  York"  "area 
manager"  or  "New  York  area"  "manager " depending  on  exactly  how 
they  were  stored  in  the  data  base.  In  cases  like  this  the  data 
base  itself  is  clearly  the  preferable  place  to  dynamically  extract 
such  information,  since  it  may  vary  with  time 

In  order  for  the  data  base  to  be  sn  extension  of  the 
dictionary  we  must  also  be  able  to  extract  the  same  information 
from  it  that  we  can  from  the  dictionary.  For  example,  one  item  we 
expect  from  the  dictionary  is  the  syntactic  category  of  a word. 
Ke  could,  of  course,  store  such  information  ^n  the  data  base  along 
with  the  actual  words.  This  goes  along  With  the  rather  obvious 
strategy  of  making  the  dictionary  an  actual  file  under  the  control 
of  the  data  base  management  system.  However,  this  does  not  avoid 
the  issue  of  how  these  facts  are  entered  in  the  data  base, 
particularly  as  the  data  base  is  dynamically  updated.  In  any  case 
this  is  a massive  effort,  as  well  as  a significant  perterbation  of 
the  existing  data  base.  It  would  be  far  more  practical  to  at 

least  attempt  to  leave  the  data  base  intact  and  try  to  work  around 
the  problem.  In  this  way  we  can  hope  to  interface  directly  to  an 
existing  data  base,  without  changing  it  in  anv  way  Clearly  this 
is  a desirable  goal  if  it  can  be  achieved. 

But  how  can  we  hope  to  parse  a sentence  without  knowing  the 
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syntactic  category  of  every  word  in  the  sentence?  This  is  a very 
unusual  situation,  one  in  which  we  do  know  the  semantic  use  of  the 
word,  namely  how  and  where  it  appears  in  the  data  base,  but  not 
its  syntactic  use.  If  we  could  only  use  a parser  that  was 
forgiving  enough  to  allow  the  use  of  such  words  in  a sentence, 
building  only  the  syntactic  structure  for  what  it  recognized,  then 
later  when  high  level  semantic  analysis  begins,  we  might  be  able 
to  merge  the  semantic  knowledge  about  these  words  into  the  overall 
semantic  structure  of  the  sentence.  For  example  the  sentence 
"Print  the  names  and  phone  numbers  of  all  of  the  secretaries" 
might  generate  the  following  incomplete  semantic  structure. 


(FILE  (EMPLOYEE) ) 

(PRINT  (NAME  PHONE)) 

(SEARCH  (UNKNOWN-FIELD  = SECRETARIES)) 


This  could  be  merged  with  the  semantic  knowledge  extracted 
from  the  data  base  that  "SECRETARY"  actually  appears  in  the  JOB 
field.  This  would  form  the  following  complete  semantic  structure 
suitable  for  initiating  a full  query  to  the  data  base. 

(FILE  (EMPLOYEE) ) 

(PRINT  (NAME  PHONE) ) 

(SEARCH  (JOB  = SECRETARY)) 

Is  it  feasible  to  drive  a commercial  DBMS  in  this  way?  Can 
we  afford  to  dynamically  find  all  the  fields  a given  word  appears 
in?  It  turns  out  that  depending  on  the  design  philosophy  of  the 
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DBMS  it  may  be  quite  feasible  to  perform  these  operations.  The 
sample  dialogue  that  follows  demonstrates  that  this  approach  falls 
well  within  the  state  of  the  art  of  DBMS. 

Very  often  people  conjure  up  images  of  seauential  passes  over 
the  file  to  see  if  "green"  appears  in  any  of  the  fields.  This,  of 
course,  would  be  totally  out  of  the  question.  Access  to  the  data 
is  achieved  in  a number  of  different  ways  in  all  of  the  major 
DBMSs.  Thus,  the  data  base  designer  has  the  choice  of  using  hash 
code  techniques,  data  inversion,  network  structures,  or  relational 
structures.  Of  these,  the  mechanism  most  suited  for  the  type  of 
access  we  require  is  data  inversion.  It  turns  out  that  the 
questions  about  the  existence  of  a word  in  the  data  base  can  be 
answered  by  primitive  operations  on  an  inversion  index  thus  they 
are  extremely  efficient.  In  this  sense  we  are  using  the  inversion 
index  as  an  associative  store. 

Before  specifying  exactly  what  data  inversion  is,  let  me 
preface  the  discussion  with  the  fact  that  the  natural  language 
analysis  couldn't  care  less  how  the  answers  are  obtained.  It  is 
certainly  not  dependent  on  inversion.  Any  of  DBMS  techniaues  that 
exist  now  or  in  the  future,  that  can  answer  these  kinds  of 
questions  in  an  acceptable  time  frame,  are  acceptable. 

The  best  example  of  an  inverted  data  base  is  the  index  at  the 
end  of  a book.  If  it  were  fully  inverted,  every  word  used  in  the 
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book  would  would  appear  (only  once)  in  the  index  along  with  a set 
of  pointers  (page  numbers)  to  where  the  word  appears.  Given  such 
a book  and  such  an  index,  how  hard  is  it  to  tell  if  the  word 

"aardvark"  appeared  in  the  book?  Quite  easy,  since  the  index  is 

<2, 

alphabetized.  In  fact,  in  order  to  to  answer  that  question  you 
could  throw  away  the  list  of  pointers  since  you  don't  need  to  know 
the  page  numbers  it  appears  on.  In  fact,  you  could  throw  away  the 
book  itself,  since  the  answer  is  wholly  contained  in  the  index. 

To  invert  an  entire  data  base,  you  simply  treat  each  field  as 
a new  book  and  invert  each  field.  Thus,  you  could  poll  the  fields 
by  asking  if  a given  word  is  in  the  index  of  any  field.  In  this 
way  you  find  out  whether  the  word  appears  at  all,  and  if  so  what 
fields  it  appears  in. 

Conceptually  you  could  take  this  collection  of  indices  and 
treat  it  as  another  book  and  invert  it,  creating  a higher  level 
index,  that  for  any  word  in  the  data  base  would  give  a list  of 
pointers  to  the  inversion  tables  in  which  that  word  appears.  In 
such  a case  the  answer  to  our  question  is  merely  a single  search 
in  an  alphabetized  list.  To  my  knowledge  none  of  the  commercial 
DBMS's  maintain  this  higher  level  index  structure. 

These  secondary  indices  are  created  for  efficient  searching 
and  sorting  of  the  records  in  the  file.  In  fact,  arbitrary  search 
union  and  intersection  as  well  as  sorting  can  be  defined  as 
operations  on  the  inversion  tables  themselves  so  that  no  data  need 
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ever  be  retrieved  from  the  file  until  it  is  known  to  satisfy  the 
search  criteria  and  be  in  the  desired  order.  It  is  for  this 
reason  that  the  inverted  indices  are  created  and  maintained  by  the 
DBMS  as  new  data  is  entered.  We  are  merely  making  use  of  an 
already  existing  structure  within  the  DBMS,  and  making  use  of  it 
in  straightforward  manner. 

Tests  performed  on  a 10  million  byte  data  base  indicate  that 
response  time  is  well  under  5 seconds  real  time  even  when  more 
than  20  calls  of  this  type  are  made  to  the  DBMS. 
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IV  A Complete  Methodology 

In  this  section  we  discuss  how  a system  might  make  use  of 
a data  base  in  this  way.  The  basic  distinction  of  this  approach 
is  that  we  will  make  several  calls  to  the  DBMS  while  trying  to 
understand  the  sentence,  as  well  as  the  one  final  call  to  actually 
retrieve  the  answers.  Most  other  query  facilities  try  to  limit 

themselves  to  this  one  final  call. 

I 

Basically  the  order  of  the  processing  is  as  follows.  The 
query  is  broken  down  into  individual  words,  each  of  which  is 
looked  up  in  the  dictionary,  and  if  not  there  in  the  data  base. 
Morphology  is  automatically  performed  during  each  of  these 
searches.  In  the  cases  where  individual  words  are  not  found, 
composite  groups  are  formed  as  they  appear  in  the  data  base. 

At  this  point  syntactic  analysis  begins,  potentially  building 
several  incomplete  interpretations.  The  holes  within  these 
semantic  structures  can  now  be  merged  with  knowledge  gained  from 
asking  the  data  base  about  individual  words,  as  illustrated 
earlier.  Only  one  task  remains,  namely  selecting  the  one 
interpretation  that  was  intended  by  the  user  from  the  set  of  well 
formed  interpretations  that  still  remains.  Once  again  we  turn  to 
the  data  base  to  aid  in  the  resolution  of  this  problem. 
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As  an  example  of  this  situation  consider  the  following 
sentence. 

"TELL  ME  ABOUT  GREEN  FORD  CARS." 

Assume  that  we  have  generated  interpretations  based  on  the  fact 
that  "GREEN"  appears  in  both  the  color  and  name  fields,  and  "FORD" 
appears  in  the  name  and  manufacturer  fields.  Thus  the  four  search 
terms  would  be 

1.  (AND  (COLOR  = GREEN)  (MANUFACTURER  = FORD';) 

2.  (AND  (COLOR  = GREEN)  (NAME  = FORD)) 

3.  (AND  (NAME  = GREEN)  (MANUFACTURER  = FORD)) 

4.  (AND  (NAME=GREEN ) (NAME=FORD) ) 

It  should  be  pointed  out  that  if  the  user  had  chosen  a richer 
syntactic  expression  of  his  query,  most  of  these  interpretations 
would  not  have  been  generated.  However,  let  us  take  the  example 
exactly  as  given. 

How  can  we  use  the  data  base  to  help  select  the  intended 
interpretation?  We  simply  push  the  idea  of  using  the  data  base  as 
a definition  of  world  knowledge  a little  further.  By  applying 
all  four  search  expressions  to  the  data  base  we  can  get  a reading 
on  how  meaningful  each  query  is,  given  the  current  state  of  the 
data  base.  We  should  clearly  make  note  of  the  fact  that  this  is 
not  in  any  sense  a reading  on  how  likely  it  is  that  the  user 
intended  this  interpretation.  But  by  determining  whether  or  not 
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each  search  term  has  zero,  or  positive  hits  in  the  data  base  we 
can  employ  the  following  very  useful  heuristic.  If  there  is 
exactly  one  interpretation  with  positive  hits  we  select  that 
interpretation  as  the  one  intended  by  the  user,  with  an 
appropriate  echo  of  the  interpretation  to  act  as  a warning.  If 
there  are  several  positive  hit  interpretations  we  must  ask  the 
user  which  of  these  he  intended,  as  the  heuristic  is  of  no  help,  in 
these  cases.  Finally  if  there  are  only  zero  hit  interpretations 
we  can  safely  answer  "none"  or  "no"  appropriately. 

Thus,  for  our  example,  assuming  that  only  interpretaion 
number  one  had  positive  hits,  then  this  interpretations  would  be 
selected,  and  the  echo  that  we  were  searching  for  green  colored 
cars  made  by  Ford,  would  be  printed. 

This  heuristic,  which  may  seem  rather  tenuous  at  first,  is 
based  on  the  premise  that  people  will  tend  to  ask  questions  about 
things  that  are  in  the  data  base.  In  terms  of  the  data  base 
defining  world  knowledge,  we  are  ascertaining  which  of  the 
interpretations  do  not  make  sense  with  respect  to  the  known  state 
of  the  world.  This  heuristic  is  yet  to  fail,  to  my  knowledge,  in 
any  real  user  session.  It  is  however,  very  easy  to 
conjure  up  situations  in  which  it  would  fail,  which  is  exactly  why 
it  is  labelled  a heuristic.  The  cost  of  such  failure  is  small 
indeed  considering  that  the  user  is  warned  of  the 
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misinterpretation  and  can  always  rephrase  the  sentence  using  more 
syntactic  clues  to  indicate  the  desired  meaning. 

The  notion  of  asking  the  DBMS  whether  there  are  any  hits  on  a 
given  interpretation  is  not  particularly  expensive.  People  often 
imagine  that  each  search  requires  making  a pass  over  the  data  base 
retreiving  each  record  to  see  if  it  meets  the  search  criteria. 
However,  by  making  use  of  the  inversion  tables  the  DBMS  can 
perform  the  search  logic  on  the  index  and  immediately  tell  whether 
any  records  satisfy  it  or  not.  Thus  the  question  can  be  answered 
without  retreiving  a single  record,  meaning  that  its  basically  an 
in  core  operation  and  thus  quite  fast. 
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The  techniques  discussed  n th 

process  a large  subset  of  the  ouestions  ,,  is  - ei\  t ) ask. 

mo  best  exemplify  this  level  or  competence,  ■- 
questions  are  given,  all  wh  ch  can 

techniques  described  herein. 


The  ouestions  pertain 
.bout  employees  and  cars, 
the  data  base  as  a defmitJ 
the  words  "Ford"  nor  'green'  e; 

interpretation  of  these  que  ependent 

contents  of  the  data  base,  re,  let 

the  color  field,  or  both.  trate  thi 

of  the  data  base  as  an  ext 
phrases,  "Vice  President' 

dictionary  and  therefore,  must  be  pieced  t remaii 

sentences  illustrate  the  over  ! ; i - -c  ' c..-  ■■  ..ex  , t " 

that  can  currentlv  be  dealt,  with.  I icu 
use  of  pronouns  and  sentences  rvnt  ,. 


WHATS  IN  THE  EMPLOYEE  FILE. 


WHAT  FIELDS  ARE  IN  THE  FILE  OP  CARS? 

WHICH  CARS  ARE  FORDS? 

WHICH  OF  THOSE  ARE  GREEN? 

6 

PRINT  A MILEAGE  REPORT  BY  MANUFACTURER  FOR  ' 71  VOLVOS  AND  PORSHFS 

n 

INCLUDING  THEIR  MODEL  AND  COLOR. 

LIST  THE  PHONE  NUMBERS  OF  THE  SINGLE  WOMEN  IN  FT. LOUIS. 

GIVE  ME  A SALARY  HISTOGRAM  FOR  TnEM. 

GIVE  ME  A SORTED  LIST  OF  NAMFS  OR  ALL  THE  VICE  PRESIDENTS 
IN  CHICAGO  OP  LOS  ANGELES. 

ARE  THERE  ANY  PEOPLE  WORKING  AS  SECRETARIES  TEAT  EAR  A SALARY 
OF  55,000  OR  MORE? 

BROKEN  DOWN  BY  MANUFACTURER,  PRINT  A LIST  OP  ALL  THE  '70  GREEN  CARS 
WITH  OVER  50,000  MILES  ON  THEM. 

FIND  THE  CARS  MADE  BY  PORSCHE  AND  MADE  IN' 71. 

■ 

BROKEN  DOWN  JOB,  REPORT  ON  THEIR  NAME,  SALARY,  AND  PHONE. 

SALARY  OF  EMPLOYEES  EARNING  > 540,000. 
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